Using deep learning to detect digitally encoded DNA trigger for Trojan malware in Bio-Cyber attacks

This article uses Deep Learning technologies to safeguard DNA sequencing against Bio-Cyber attacks. We consider a hybrid attack scenario where the payload is encoded into a DNA sequence to activate a Trojan malware implanted in a software tool used in the sequencing pipeline in order to allow the perpetrators to gain control over the resources used in that pipeline during sequence analysis. The scenario considered in the paper is based on perpetrators submitting synthetically engineered DNA samples that contain digitally encoded IP address and port number of the perpetrator’s machine in the DNA. Genetic analysis of the sample’s DNA will decode the address that is used by the software Trojan malware to activate and trigger a remote connection. This approach can open up to multiple perpetrators to create connections to hijack the DNA sequencing pipeline. As a way of hiding the data, the perpetrators can avoid detection by encoding the address to maximise similarity with genuine DNAs, which we showed previously. However, in this paper we show how Deep Learning can be used to successfully detect and identify the trigger encoded data, in order to protect a DNA sequencing pipeline from Trojan attacks. The result shows nearly up to 100% accuracy in detection in such a novel Trojan attack scenario even after applying fragmentation encryption and steganography on the encoded trigger data. In addition, feasibility of designing and synthesizing encoded DNA for such Trojan payloads is validated by a wet lab experiment.


AGATATAAAGTACGACAGTGCTCTCGGCCCTT AGATATACAGTACTCAATGGATACATCTCCTT AGATATAGAGTAATCCATATCGAGAGTGCCTT AGATATATAGTACGTACGACCGAGATGGCCTT AGATATCAAGTAATGAATCAATGCATAGCCTT
In these sequences, each line corresponds to a fragment of the trojan payload address (host names and port addresses only). We can insert any encoded line representing a fragment (without breaking) at any position inside an existing DNA sequence (also called our host DNA).
However, note that we can not break a encoded line further as it represents a fragment.
Furthermore the overlapping (if any) needs to be managed carefully. To summarize, the content of one file can be placed inside one plasmid, where any line can be put at any position (i.e., each line is a part of either the host name or port address of different machines that want to form a connection). NovaBlue cells, as per Mix&Go! kit protocol, and aliquots were spread on pre-warmed LB/Amp (Ampicillin 100 µg/ml) agar plates. A negative control plate was prepared by adding 1 µL of sterile water in place of DNA. Plates were incubated at 37°C overnight. Successfully transformed cells were selected via ampicillin resistance as a selection marker.
Successfully transformed isolated colonies were then inoculated into LB/Amp broth and cultures were incubated until an OD 600nm = 2 was reached. OD 600nm measurements were taken using the NanoDrop™ 1000 (Thermo Scientific™). Cultures were then concentrated to an OD 600nm = 10.
Once cultures were at the appropriate OD 600nm , plasmid DNA was purified using the Monarch® Plasmid Miniprep Kit (NEB) as per manufacturer's instructions. Plasmid samples were eluted in sterile water and the DNA concentration and quality was assessed using the NanoDrop™ 1000.
The presence of the plasmid for each sample was verified using agarose gel electrophoresis (0.8% agarose made with 1xTAE buffer) (Fig. A.3).

DNA Sequencing
Samples were sequenced by Eurofins Genomics Europe Sequencing GmbH, Germany.

Analysis of sequencing data
Analyses of sequencing data were carried out using a combination of Chromas (v 2.6.6) and MEGA-X (v 10.2.6). Sequencing chromatogram quality was first assessed using Chromas.
Sequence alignments were performed using the CLUSTALW algorithm in MEGA-X. Following successful alignment of DNA sample sequence with reference sequence, the sequences were trimmed in Chromas to highlight the 'Trojan payload applying steganography' DNA and 'Normal Trojan payload' DNA only for analysis. (Sample sequencing results are shown in Fig A.4).

Fig. A.4:
Sample sequencing chromatogram from pNOSTEG with 60bp DNA sequence region for Trojan payload address without encryption and steganography applied visible.

Author Contribution
Mr. Mohd Siblee Islam is the primary author of the article. Mr. Islam was responsible for developing the software code used to perform computational experiment, executing the experiments, analysing and interpreting the results presented in this article, writing the manuscript. Dr. Witty Sri-saan was the scientific driver behind the DNN analysis for the DNA strands with the injected code, as well as the development of the hacking scenarios.

Dr. Stepan Ivanov
Dr. Hamdan Awan was responsible for the analysis of the data in the results section and in particular the analysis on performance based on variations in parameters.

Data Availability Statement
All data used in the manuscript are available in the Addgene repository (https://www.addgene.org/), where the DNA sequences of type plasmid of E.Coli bacteria are collected for our experiments using web scraping. This data is also available as a supplementary document (all_plasmid_dna.txt). The Programming code developed to conduct the experiments (also the scripts for the data collection from Addgene) is freely available in the publicly available git repository at the following URL: https://github.com/sibleeislam/trojan-malware-in-bio-cyber-attacks. For any further query related to data availability please contact using the email of the primary author (sibleeislam@gmail.com) of the manuscript.